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Abstract 

The position we advocate in this paper is that relational al¬ 
gebra can provide a unified language for both representing 
and computing with statistical-relational objects, much as lin¬ 
ear algebra does for traditional single-table machine learning. 
Relational algebra is implemented in the Structured Query 
Language (SQL), which is the basis of relational database 
management systems. To support our position, we have de¬ 
veloped the FactorBase system, which uses SQL as a high- 
level scripting language for statistical-relational learning of 
a graphical model structure. The design philosophy of Fac¬ 
torBase is to manage statistical models as first-class citi¬ 
zens inside a database. Our implementation shows how our 
SQL constructs in FactorBase facilitate fast, modular, and 
reliable program development. Empirical evidence from six 
benchmark databases indicates that leveraging database sys¬ 
tem capabilities achieves scalable model structure learning. 


Introduction 

The statistical analysis of structured data requires structured 
machine learning models. Database researchers have de¬ 
veloped a system architecture where statistical models are 
stored as first-class citizens inside a relational database man¬ 
agement system (RDBMS) ( |Wang et al. 2008[ |Niu et aL 
2011| ). The goal is to seamlessly integrate query process 


ing and statistical-relational inference inside the database, 
rather than invoking external procedures. These systems fo¬ 
cus on inference given a statistical-relational model, not on 
learning the model from the data stored in the RDBMS. We 
describe the FactorBase system that complements the in¬ 
database probabilistic inference systems with an in-database 
probabilistic model learning system. The name FACTOR- 
BASE indicates that our system supports learning a set of 
(par)-factors for a log-linear multi-re lational model, typi- 
cally represented in a graphical model ( [Kimmig, Mihalko^^ 
[and Getoor 2015| ). 

There are several p revious systems that leverage RDBMS 
suppo rt for learning ([Hellerstein et al. 2012[ |Kraska et al. 
2013[ [Deshpande and Madden 2006| ), but they apply to tra¬ 


ditional learning where the data are represented in a sin¬ 
gle table or data matrix. The novel contribution of Fac- 
TORBase is supporting graphical model learning for multi- 
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relational data s tored in different interrelated tables . The 
Sindbad system ( Wicker, Richter, and Kramer 2010| pro¬ 
vides support for some multi-relational knowledge discov¬ 
ery tasks in an inductive database, but not for graphical 
model construction. Multi-relational graphical model con¬ 
struction raises several special challenges, such as: 

1. A description language for specifying metadata about 
structured random variables. 


2. Efficient mechanisms for constructing, storing, and trans¬ 
forming complex statistical objects, such as cross-table 
sufficient statistics, parameter estimates, and model selec¬ 
tion scores. 


3. Computing model prediction scores for relational test in¬ 
stances. 

FactorBase applies SQL as a scripting language to im¬ 
plement database services that provide these capabilities. 
While FactorBase provides a good solution for each of 
these system capabilities in isolation, the ease with which 
large complex statistical-relational objects can be integrated 
via SQL queries is a key feature. 


Evaluation 

Our system is fully implemented and source code is avail¬ 
able available on-line ( |Qian and Schulte 2015| ). We sum¬ 
marize the evaluation of FactorBase on six benchmark 
databases. For each benchmark database, the system ap¬ 
plies the leam-and-join algorithm, a state-of-the-art SRL 
algorithm that constructs a statistical-relatio nal Bayesian 
network model (Schulte and Khojravi 2012| . The learned 
Bayes net structure can be conver ted to a Markov Logi c net¬ 
work structure or a set of clauses ( [Khosravi et al. 201^ . The 
same SQL scripts work for all benchmark databases, which 
demonstrates the generality of our approach. 

Our experiments show that FactorBase pushes the 
scalability boundary: Learning scales to databases with over 
10^ records, compared to less than 10^ for previous systems. 
At the same time it is able to discover more complex cross¬ 
table correlations than previous SRL systems. The scalabil¬ 
ity improvement is mainly due to the efficient computation 
and caching of sufficient statistics supported by SQL. Our 
experiments focus on two key services for an SRL client: (1) 
Computing and caching sufficient statistics, (2) computing 
model predictions on test instances. The system can handle 
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Figure 1: System Flow. All statistical objects are stored as first-class citizens in a DBMS. Objects on the left of an arrow are 
utilized for constructing objects on the right. Statistical objects are constructed and managed by different modules (boxes). 


as many as 15M sufficient statistics. SQL facilitates block- 
prediction for a set of test instances, which leads to a 10 to 
100-fold speedup compared to a simple loop over test in¬ 
stances. 


Benefits 

We advocate using SQL as a high-level scripting language 

for SRL, because of the following advantages. 

1. Extensibility and modularity, which support rapid proto¬ 
typing. SRL algorithm development can focus on statis¬ 
tical issues and rely on the RDBMS for data access and 
processing. 

2. Increased scalability, in terms of both the size and the 
complexity of the statistical objects that can be handled. 

3. Generality and portability: standardized database opera¬ 
tions support “out-of-the-box” learning with a minimal 
need for user configuration. 


System Overview 

Our system design represents statistical objects as relational 
tables, on a par with the original data tables, so that SQL 
can be used to manage them. Figure [^represents key system 
components. The starting point is a multi-relational database 
containing the input data. 


System Components 

The Schema Analyzer The schema analyzer is an SQL 
script that queries the system catalog table to define a default 
set of relational random variables (par-RVs) for statistical 
analysis (Kimmig, Mihalkova, and Getoor 2015} . The meta¬ 
data include the domain of the par-RVs (possible values), 
and type information (possible arguments). The schema an¬ 
alyzer extracts metadata about the random variables from the 
database system catalog. The random variables and associ¬ 
ated metadata are stored in the random variable database 
VDB. We highlight two features of the VDB component. 

(i) The set of par-RVs and the associated metadata is con¬ 
structed automatically from the input database. Thus Fac- 
TORBase utilizes the data description resources of SQL to 
faciliate the “setup task” for relational learning ([Walker et 
|al. 2Q10D . 

(ii) Representing metadata explicitly offers two advan¬ 
tages. First, a user can easily edit the VDB to customize the 


learning behavior, for instance by deleting irrelevant par- 
RVs. Second, it is possible to export metadata from other 
formats to the VDB format. In this way FactorBase can 
serve as a structure learning backend to expressive specifi¬ 
cation languages for other relational models ([Guazzelli et al.| 
[200^ [Milch et al. 2005D . 


The Count Manager A key service for statistical- 
relational learning is counting how many times a given re¬ 
lational pattern (par-RV) is instantiated in the data. Such 
counts are known as sufficient statistics. Accessing sufficient 
statistics is often the main scalability bottleneck. The access 
patterns of a model search procedure are inherently sequen¬ 
tial and random ( [Niu et al. 201 1[ ), and therefore it is impor¬ 
tant to cache sufficient statistics. Caching is even more im¬ 
portant if the data is stored on disk in an RDBMS, rather than 
in main-memory. There are several reasons for employing an 
RDBMS for gathering sufficient statistics. (1) The machine 
learning application saves expensive data transfer by exe¬ 
cuting count operations in the database server space rather 
than local main memory. (2) SQL optimizations for aggre¬ 
gate functions such as SUM and COUNT can be leveraged. 
(3) Sufficient statistics can be stored in the RDBMS. For 
many datasets, the number of sufficient statistics runs in the 
millions and is too big for main memory. A novel aspect of 
FactorBase is managing multi-relational sufficient statis¬ 
tics that combine information across different tables in the 
relational database. This r equires combining SQL aggreg ate 
functions with table joins ( Qian, Schulte, and Sun 2014] ). 


The Model Manager The Model Manager supports the 
construction and querying of large structured statistical 
models, which are stored in the Model Database MDB. 
Services provided by the Model Manager include the fol¬ 
lowing. (1) Compute parameter estimates for the model us¬ 
ing the sufficient statistics in the Count Database. (2) Com¬ 
puting model characteristics such as the number of param¬ 
eters or degrees of freedom in a model. (3) Computing a 
model selection score that quantifies how well the model fits 
the multi-relational data. Model selection scores are usually 
functions of the number of parameters and the parameter es¬ 
timates. By employing the SQL view mechanism, parameter 
estimates and model selection scores are updated automati¬ 
cally during the model search. 
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